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Abstract: We develop a clustering framework for observations from a population 
with a smooth probability distribution function and derive its asymptotic properties. 
A clustering criterion based on a linear combination of order statistics is proposed. 
The asymptotic behavior of the point at which the observations are split into two 
clusters is examined. The results obtained can then be utilized to construct an 
interval estimate of the point which splits the data and develop tests for bimodality 
and presence of clusters. 
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1. Introduction 

In this article, we develop a general framework for univariate clustering based on the 
ideas in Hartigan (1978) for the case of observations from a population with smooth and 
invertible distribution function. Contrary to Hartigan's approach, which was based on a 
quadratic function of the observed data, our clustering criterion function possesses the 
advantage of being a linear combination of order statistics — in fact, it is a combination 
of trimmed sums and sample quantiles. 

It is common in certain applications to assume that the data are taken from a popula- 
tion with smooth distribution function. One important example is modeling in continuous- 
time mathematical finance, wherein observations are typically increments from a continuous- 
time stochastic process, and therefore, have smooth distributions because of presence of 
ltd integral components. Keeping this in mind, we deviate from the Hartigan's frame- 
work and concentrate our attention on a function of the derivative of his split function. 
This approach permits us to obviate the existence of a finite fourth moment assumption 
imposed by Hartigan in the asymptotic investigation of his criterion function — a second 
moment assumption at the cost of an additional smoothness condition on our criterion 
function suffices. As an added benefit, this modification of Hartigan's approach provides 
us with the genuine possibility of extending our existing theory to more interesting sce- 
narios involving dependent observations. 

The notion of a "cluster" has several reasonable mathematical definitions. As in Har- 
tigan (1978), we adopt a definition based on determining a point which splits the data 
into clusters via maximizing the between cluster sums of squares. The main results in this 
article involve the asymptotic behavior of this particular point. The theoretical proper- 
ties of fc-means clustering procedure for the univariate and the multivariate cases have 
been extensively investigated. Pollard (1981), Pollard (1982) proved strong consistency 
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and asymptotic normality results in the univariate case. Serinko and Babu (1992) proved 
some weak limit theorems under non-regular conditions for the univariate case. With 
the intention of having a more robust procedure for clustering, Garcfa-Escudero, Gordal- 
iza and Matran (1999), and Cuesta-Albertos, Gordaliza and Matran (1997) propose the 
trimmed fc-means clustering and provide a central limit theorem for the multivariate case. 
Throughout the article we are primarily concerned with the case k — 2 on the real line. For 
extension of the split point approach to the case k > 2 we refer readers to the discussion 
in Hartigan (1978). 

On a more practical note, our results enable us to construct an interval estimate of the 
point at which the data splits; this naturally allows us to develop simple tests for bimodal- 
ity and presence (or absence) of clusters. Hypothesis tests for the presence (or absence) of 
clusters in a dataset has attracted considerable interest over the years. One of the earliest 
work in this area was by Engleman and Hartigan (1969); they developed a univariate 
method to test the null hypothesis of normally distributed cluster against the alternative 
of a two-component mixture of normals. Wolfe (1970) extended the work of Engleman and 
Hartigan (1969) to the multivariate normal setup using MLE techniques and applied his 
method to Fisher's Iris data. Motivated by applications in market segmentation, Arnold 
(1979) proposed a test for clusters based on examining the within-groups scatter matrix. 
A dataset generated from a large-scale survey of lifestyle statements was considered and 
the objective was to capture heterogeneity in the distribution of the responses to an ap- 
propriate questionnaire. Based on statistics concerning mean distances, minimum-within 
clusters sums and the resulting F-statistics, Bock (1985) presented several significance 
tests for clusters. In fact, he generalized some of the results in Hartigan (1978) to the 
multivariate setup. In similar spirit, we note that in our method, the point which splits 
the data is invariant to scaling and translation of the data. This permits us to examine 
the behavior of the point under the null hypothesis of "no cluster" and thereby construct 
a suitable test. 

In Hartigan and Hartigan (1985), a popular test for unimodality, referred to as Dip 
test, was proposed and was applied to a dataset pertaining to the quality of 63 statistics 
departments. Indeed, their test did not possess good power against the specific bimodal 
alternative. More recently, Holzmann and Vollmer (2008) proposed a parametric test for 
bimodality based on the likelihood principle by using two-component mixtures. Their 
method was applied to investigate the modal structure of the cross-sectional distribu- 
tion of per-capita log GDP across EU regions. Using the Kolmogorov-Smirnov and the 
Anderson-Darling statistics, Schwab, Podsiadlowski and Rappaport (2012), performed a 
test for bimodality in the distribution of neutron-star masses. They compared the empir- 
ical cumulative distribution function to the distribution functions of a unimodal normal 
and a bimodal two-component normal mixture. Our results enable us to construct, on 
identical lines as the test for clusters, a test for bimodality. The test statistic, again, is 
based on the point at which the data is split — the split point is the same for all unimodal 
distributions with finite second moment and can hence be used as the test statistic. 

In section 2 we introduce the relevant constructs of our clustering framework: a the- 
oretical criterion function and its zero followed by the empirical criterion function and 
its "zero". These quantities are of chief interest in this article. In section 3, we prove 
limit theorems for the empirical zero by examining the asymptotic behavior of the empir- 
ical criterion function and offer numerical verification of the limit results via simulation. 
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Furthermore, we demonstrate the utility of our results on the popular faithful dataset 
pertaining to eruption times for the Old Faithful geyser in Yellowstone National Park, 
Wyoming, USA. Finally, in section 4 we highlight the salient features of our approach, 
note its shortcomings and comment on possible remedies and extensions. 

2. Clustering Criterion 

In this section, based on Hartigan's approach, we propose an alternative clustering crite- 
rion and examine its properties. We first state our assumptions for the rest of the paper. 

2.1. Assumptions 

Let W\,Wi, • • • , W n be i.i.d. random variables with cumulative distribution function F. 
We denote by Q the quantile function associated with F. We make the following assump- 
tions: 

Al. F is invertible for < p < 1 and absolutely continuous with respect to Lebesgue 

measure with density /. 
A2. E(Wi) = and £(W X 2 ) = 1. 

j43. Q is twice continuously differentiable at any < p < 1. 

Note that owing to assumption Al, the quantile function Q is the regular inverse of F 
and not the generalized inverse. 

2.2. Empirical Cross-over Function and Empirical Split Point 

Let us first consider the split function that was introduced in Hartigan (1978) for parti- 
tioning a sample into two groups. The split function of Q at p G (0, 1) is defined as 

B(Q, P ) = p(Qi(p)) 2 + (I ~ p)(Qu(p)) 2 - (£Q(q)dq\ , (2.1) 

where 

Ql(p) = - [ Q(q)dq = \E[WxI Wl<Q{p) \ 

P J q<p P 

and 

If 1 

Qui?) = i / Q(q)dq = E[Wil Wl > Q ( p) ] 

1 — P Jq>p 1 — P 

represent the conditional expectations of the random variables Wi up to and from Q(p). 
Here Ia denotes the indicator function of a set A. In our case since EW\ = the last 
term in the definition of the split function is 0. A value po which maximizes the split 
function is called the split point. It is seen that if Q is the regular inverse, as in our case, 
Po satisfies the equation 

(Q«(po) - Qi(po))[Q«(po) + QM - 2Q(po)] = 0, (2.2) 
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where the LHS is the derivative of B(Q,p). Evidently, (Q u (p) — Qi(p)) > for all < p < 1 
and we hence, for our purposes, consider the cross-over function, 

G(p) = Qi(p) + Q u (p)-2Q(p), (2.3) 

for examining clustering properties. From a statistical perspective, we would like to work 
with the empirical version of (2.3). We deviate here from Hartigan's framework and 
consider the empirical cross-over function(ECF) , defined in Bharath, Pozdnyakov and 
Dey (2012) as 

k 



G n (p) = I £ W U) - W (k) + ^ £ W U) - W (k+1) , (2.4) 
for — < p < - and 



k^ [J > w n-k 

3 = 1 j=k+l 



1 

G»(p) = ~XVtf)-W (n)l ( 2 - 5 ) 
3=1 

for 2=i < p < 1, where 1 < fc < n — 1. 

The random quantity G n , represents the empirical version of (2.3) and determines the 
split point for the given data. G n is an L-statistic with irregular weights and hence not 
amenable for direct application of existing asymptotic results for L-statistics. Observe 
that 

/n\ i ™ 

^ - =%-%+-tE%-% >0, 
V J 3=2 



n - 1 \ 1 



3=1 



This simple observation captures the typical behavior of the empirical cross-over function. 
It starts positive and then at some point crosses the zero line. The index k at which 
this change occurs determines the datum at which the split occurs. In Bharath, 

Pozdnyakov and Dey (2012), it is shown that G n (p) is a consistent estimator of G(p) for 
each < p < 1 and also that ^/n{G n {p) — G{p)) is asymptotically normal. 

We now introduce the empirical split point in range [a, b], < a < b < 1, the empirical 
counterpart of the pg as 



p n (a,b) := < 



0, if G n (^) < Vfc such that na < k < nb + 1; 

1, if G n (^) > Vfc such that na < k <nb+ 1; 

i [max{na < fc < nb : G n (^) G„ (£) < 0}] , otherwise. 



The quantity p„ is our estimator of poj the true split point (when it is in the range). If 
p n is equal to or 1, we declare that the split point is outside the range. The asymptotic 
behavior of p n can be used for the construction of test for the presence of clusters in the 
observations, or for the estimation of the true split point. 
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Remark 1. Let us provide some intuition behind Hartigan's split function. The fc-means 
clustering method for the case k — 2 requires us to minimize (with respect to k*) the 
following within group sum of squares: 



k / k 



E 

s :=fe*+i 



w {i) - 



1 



n — k* 



E w i 



(i) 



i=k*+l 



* 2 

E»?.,-f (!><■>) -5^f( t 

i=l \i=l / \i=fc*+l 



IF, 



(i) 



That is, minimizing IF* is equivalent to maximizing 

2 



\i=l / \i=fc» + l 



w, 



or 



\ 1=1 / 



1 



A;* ^ 

i=fe* + l 



(0 



which is basically an empirical version of Hartigan's split function (2.1) and k* /n will be 
another version of empirical split point. 

In this paper we proceed in parallel with Hartigan (1978) (his Theorem 1 and Theorem 
2) and prove consistency and asymptotic normality of p n under a uniqueness assumption. 
The theoretical conditions that guarantee the uniqueness of the split point is an open 
question. It is easy to see that for a unimodal symmetric distribution with a finite second 
moment, the split point is 1/2, and for all unimodal symmetric light-tailed distributions 
that we checked the split point was unique. However, Hartigan (1978) gives an example of 
a unimodal symmetric heavy-tailed distribution for which every point in (0, 1) is a split 
point. For a bimodal distribution the split point p is typically unique and Q(pq) lies 
between the cluster means. 

The presented results can be employed for testing "no-clusters" hypothesis, testing 
bimodality, and estimation of the split point. The extension of this technique to the 
case of k clusters is discussed in Hartigan (1978). In our case, instead of one cross-over 
function one needs to introduce k — 1 functions; the split point in this case will be a 
(k — 1) -dimensional vector. To find this split point we then need to solve a system of k — 1 
equations. For instance, for partition of data into three groups we need to introduce two 
cross-over functions 



G 



2n 



h-1 k 2 - 1 



n n 



ki-1 k 2 - 1 



n n 



4e%-%)+A E w (l) -w {kl+1) , 



k 2 



1 '* A 1 



k 2 - ki 



i=k! + l 



i=k 2 + l 
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and, respectively, one needs to solve (in an appropriate sense) the following system of 
equations: 



Gin 
Gin 



fci - 1 fc 2 - 1 

n ' n 
fci - 1 fc 2 - 1 



= 0, 
= 0. 



We do not address the general k > 2 cluster situation in this paper. 

Finally, as stated in the Introduction, let us remark on the main technical difference 
between Hartigan's assumptions and ours. Since we deal here with the derivative of the 
split function, we need a stronger smoothness condition (the second derivative of G instead 
of the first one). In return, we work with trimmed means and a weaker moment condition 
suffices (the finite second moment instead of the fourth one as in Hartigan (1978)). 

Remark 2. Notice that if for constants a > and /3 and i = 1, . . . , n, 

Zi = aWi + (3, 

and we define G n to be the ECF based on Z il then, 



G~ 



k - 1 



1 k 



1 



(k) 



k 



(fc+i) 



]=k+l 
1 



iGn 



k-l 



k ^ 

3=k+l 



(f) 



w. 



(fe+1) 



and therefore, G n and G n cross-over at the same point. Thus assumption A2 is not 
restrictive since p n is invariant to scaling and translation of the data; as it should be, 
since in a clustering problem scale and location changes of the data should not affect the 
clustering mechanism. We can hence quite safely assume that we are dealing with random 
variables with mean and variance 1. 



3. Main Results 

Let us start with the functional limit theorem for U n (p) = \/n(Gn(p) — G(p)) proved in 
Bharath, Pozdnyakov and Dey (2012). 

Theorem 1. Define 

+ j— Ww^qw - ^ ( 5(p) I Wi>Q(p) 

2I Wi<Q(p) 

f(Q(p)) ' 
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Under assumptions A1-A3, 

U n (p) => U(p), 

in the Skorohod space D[a, b\, < a < b < 1 equipped with the J\ topology, where U{p) is 
a Gaussian process with mean and covariance function given by 

C(p,q) = Cov(U(p),U(q)) = Cov(6 p ,9 q ). (3.1) 

The next lemma states that the Gaussian process U(p) allows a continuous modifica- 
tion. This fact will be employed, for example, to justify the usage of the mapping theorem 
(for instance, Billingslcy (1968) and Pollard (1984)). 

Lemma 1. Under assumptions A1-A3, the centered Gaussian process U(p),a < p < b 
with covariance function in (3. 1) is continuous. 

Proof. First note that on the interval [a,b] the functions (of p) l/p, 1/(1 — p), Q(p), 
l/f(Q{p)) — Q'(p)> E\Wi^Wi<Q(p)] an d E[Wf ^Wi<Q(p)] are continuously differentiable. 
Second, the functions max(p, q) and min(p, q) are (globally) Lipschitz continuous on [a, b] x 
[a,b]. Therefore, the covariance function C(p,q) is Lipschitz continuous on [a, b] x [a, b\; 
i.e., there is a constant K such that for all p, q, p' and q' from [a, b] 

\C{p,q)-C{p\q l )\<K(\p-p l \ + \q-q'\). 

Therefore, 

E[U(p) - U(q)} 2 = C(p,p) + C(q, q) - 2C(p, q) 

< \C(p,q) - C(p,p)\ + \C(p,q) - C(p,p)\ 
<2K\p-q\. 

By Theorem 1.4 from Adler (1990) we get that U(p) is continuous. □ 

This immediately leads us to the following important consequence. 
Corollary 1. Under assumptions Al — A3, as n — > oo, 

sup |G„(p)-G(p)|4o. 

a<p<b 

Proof. Since the functional sup pg j a fe i \x(p)\ is continuous on C[a,b] equipped with the 
uniform metric, and the process U(p) is continuous (that is, U(p) G C[a, b] with probability 
1), by the mapping theorem (Pollard (1984), p. 70) we have 

sup y/n\G n (p) - G(p)\ sup \U(p)\. 

a<p<b a^P^b 

Therefore, 

sup \G n (p) - G(p)\ = ±= sup Vn\G n (p) - G(p)\ 4 0. 

a 5:P5;fr V a<p<b 

□ 

The empirical cross-over function is a step-function. The next lemma tells us that the 
jump at any p € (0, 1) is o p (l/y/n). 



imsart-ejs ver. 2011/11/15 file: clusters0320.tex date: April 2, 2013 



/Clustering Criterion 



Lemma 2. Under assumptions Al — A3, for < p < 1 and < p < — , 



as n — > oo, 



Gn 



k-1 



~ G n 



k-2 



Proof. 



G n 



k - 1 



G n 



k-2 



^ k \ n 

I u E w ® w w + — h E - ^(fc+D 



fc+i 



fe - 

Re-arranging terms, the RHS can written as 
fe-i 



^ k— 1 ^ n 

E W d) + W(*-u n _ fc + 1 E + ^« 



^ fc(/c - 1) + ^2 ( n -k)(n-k + l) " i,vl V /.■(//-/>+ 1) 

+ (W (fc _i) - W(*+i)) . 
Observe that, by the law of large numbers for trimmed sums (see Stiglcr (1973)), 

where 77 is a constant and hence, asn-> 00, 

1 fc-i 



W, 



(0 



(*■) 



i—k 



n+l 



and similarly, 



1 



(n - fc)(n - fc+ 1) 



E ^ 



i=fc+l 



Moreover, since W(fc) — > Q(p), for —^- < p < -, we have that 

Suppose M n = sup 1<fe<rl (W / (fe) — W^-i))> then from Devroye (1981), we have that 

log n * 



and therefore 



(MVi)-wW)) = o P (^p). 

It is hence the case that the RHS is o p C^J- This concludes the proof. 



□ 
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Now, we are ready to prove consistency of p n . As in Hartigan (1978) (Theorem 1) we 
require a uniqueness condition. 

Theorem 2. Assume Al — A3 hold. Suppose that G(p) = has a unique solution, pq. 
Then for any 0<a<po<b<l 

Pn ^Po, 

as n — > oo. 

Proof. Note that by Cauchy-Schwarz inequality we have 

[£WiI Wi >q (p) ] 2 < E\W?I Wl > Q(p) ]P\W! > Q( P )}. 

Since the second moment of W\ is finite, we obtain that B(Q,Q+) = B(Q,1—) = 0. 
That is, a nonnegative continuously differentiable split function B(Q,p) has a unique 
maximum at po, and, as a result, G{p) does change sign at pq. Choose a, b such that 
0<a<po<b<l. Because G is continuous on [a, 6] we get that G{p) > for a < p < po 
and G(p) < for b > p > po. Moreover, for any 6 > there exists an e > and < 6' < 8 
such that 

G(p) > e for a < p < po — 5', 

and 

G(p) < -e for b > p > p + 5'. 

By Corollary 1, as n — > oo 

P[ sup \G n (p)~G{p)\ <!) 

\a<p<b * I 

and therefore, 

P inf G n (p) > | and sup G n (p) <-^ Ul. 

\a<p<p -S' Z b>p> Po +S' 1 J 

Using the result from Lemma 2, we obtain 

P(Pn G \po-S',p +6'}) -»■ 1. 

Note that since 

P(p„ G [p - S,p + 5}) > P( Pn e [p - i'.po + <*']), 

we finally have 



^(PnG [po-5, Po + 5]) ^l. 



□ 



Now, under an additional assumption that G'(pq) < (cf. with Theorem 2 from Har- 
tigan (1978)) we will establish asymptotic normality of p n . This result will be proved in 
three steps. First, we will establish that p n is in the O p (l/y / n) neighborhood of pq. Then 
we will show that in this neighborhood G n (p) can be adequately approximated by a line 
with slope G'(po). Finally, an approach based on Bahadur's general method (see p. 95, 
Serfling (1980)) will be employed to get the CLT for p n . 
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Lemma 3. Assume Al — A3 hold. Suppose that G(p) = has a unique solution, po, and 
G'(po) < 0. If a, b are such that 0<a<po<b<l, then for any d > there exist N 
and C > such that for all n > N 

p(\ Pn - Po \<^)>l-S. 

Proof. Fix arbitrary S > 0. Using Theorem 1 and mapping theorem we have 

sup y/n\G n (p) - G{p)\ sup \U(p)\. 

Therefore, for any S > there exist N' and C > such that for all n > N f we have 

P ( sup |G n (p) - G(p)| < < ^) > l-5. (3.2) 



. a<p<b 



>n 



By the same argument as in Theorem 2, p is a unique split point, < p < 1, G(p) > 
for p < po and G{p) < for p > p . Assumption G'(po) < tells us that in a neighborhood 
of po the function G(p) behaves like a line. Taking this into account we get that there 
exist N > N' and C > such that for all n> N 

G(jp) > for a < p < po -7=, 

y/n \Jn 

and 

2C" C 
GO) < = for b > p > po + -7=. 

Then by (3.2) we find that for all n > N 

(C C' \ 
inf G„(p) > —j= and sup G n (p) < -== > 1 - S. 
a<P<Po~C/V" V n 6>p>p +C/VH V n ) 

Therefore, 

p(\Pn-Po\ < 5= ] > 



□ 



Lemma 4. Assume Al — A3 hold. Suppose that G(p) — /ias a unique solution, po, and 
G'{p ) < 0. r/ien for any C > 

sup v/n|G„(p) - G„(p ) - G'(p ){p- Po)\ -4-0, osm-oo, 



w/iere /„ = [Po - y^,Po + ^y], 



G'(pq) = - [Q(po) - Qiipo)] - — [Q(po) - Qu(Po)} - 2Q\po). (3.3) 
Po 1 - Po 
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Proof. Since the second derivative of G(jp) is uniformly continuous on po — C / \fn < p < 
Po — C/y/n we have 

G( P ) - G( Po ) = (p - p )G'(p) + 0(( P - po) 2 ) 
= ( P -p )G'(p) + O(l/n) 

It is hence sufficient to show that 

sup Vn\[G n (p) - G(p)} - [G„(p ) - G(p )]| 4 0, 
pein 

or that for any e > and S > there exists AT such that for all n > N 

P (sup v^|[G„(p) - G(p)] - [G„(p ) - G(p )]| > < 5. 

Take arbitrary <5' > 0. The functional sup pg [ po _,5/ ;Po+( 5/] \x{p) — x(pq)\ is continuous on 
C[a, b] equipped with the uniform metric. Therefore, Theorem 1 and the mapping theorem 
informs us that 

sup Vn|[Gn(p)-G(p)]-[G n (po)-G(po)]|=> sup \U(p) - U(p )\ ■ 

p —S'<p<p +S' p -S'<p<p +S' 

Since for all sufficiently large n we have 

sup V^\[G n (p) - G(p)] - [G„(p ) - G(p )]| 
< sup s/n\[G n (p)-G(p)]-[G n (p )-G(p )]\ a.s, 

po-S'<p<p +S' 

it is indeed the case that 

p( sup Vn\[G n (p) - G(p)] - [G„(p ) - G(p )]| > 

<P( sup N/n|[G n (p)-G(p)]-[G„(p )-G(p )]|>e). 

As a consequence, 



limsupP sup Vn\[G n {p) - G(p)] - [G„(p ) - G(p )]| > e 
<p( sup |t/(p)-E/(p )|>e). 

v p -<5'<p<po+<5' 7 

Because the Gaussian process U(p) is continuous, sup po _ (5 / < p <po+( 5, \U{p) — U(po)\ —> 
with probability 1 as 8' — > 0, and, therefore, it converges to in probability. Choosing 5' 
small enough we can make 

limsupPf sup V^\[G n (p) - G(p)] - [G„(p ) - G(p )]| > e) < 6. 



□ 
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Lemma 5. Assume Al — A3 hold. Suppose that G(p) — has a unique solution, po, and 
G'(po) < 0. If a, b are such that 0<a<po<b<l then asn->oo 



Gn(Po) . , -l/2\ 



where G'{p) is as defined in (3.3). 



Proof. Consider the line G n (po) + G'(po)(p — po). Let random variable p* be the solution 
of 

G n (p )+G'(po)(p-p ) = 0, 



that is, 



P = P °-VM- (3 ' 4) 



From Theorem 1, we know that 

G n (p ) - G( Po ) = G n ( Po ) = O p (n-^ 2 ) 

and we hence have that p* = po + O p (n -1 / 2 ). By Lemma 3 p n is also in a O p (l/y/n) 
neighborhood of pq. By Lemma 4 in O p (l/y/n) neighborhood of po uniformly 

G n ( P ) = G n (po) + G / (p )(p-po)+o p (n- 1 ^), 

by which we can claim that p n = p* + o p (n -1 / 2 ). The result follows by substituting for 
p* in (3.4). ' □ 

This immediately give us the final result. 

Theorem 3. Assume Al — A3 hold. Suppose that G(p) = has a unique solution, p , 
and G'(po) < 0. If a, b are such that 0<a<pa<b<l then asn-> oo, 

^»-»°>- w (».^)- 

where 9 Po is as defined in Theorem 1. 



3.1. Numerical Verification 

In this section we provide verification of our results regarding the asymptotic normality 
of p n along the lines of Table 1 in Hartigan (1978). Since our split point po coincides with 
Hartigan's split point (maximum of B(Q,p)), it is to be expected that our empirical split 
point p n behaves asymptotically similar to his. Hartigan verifies his results when obser- 
vations are obtained from a N(0, 1) population — a population with smooth distribution 
function; we do the same and note that the asymptotic mean and the variance of p n agree 
with his. 

It is a simple exercise to ascertain that, for the normal case, the split point po is 0.5, 
G' 2 (0.5) w 3.34 and Var(9 . 5 ) = 2ir - 4 w 2.283. Consequently, we observe that the 
asymptotic variance of \Zn(p n — 0.5) is approximately 0.69. The table below corroborates 
our theoretical results. 



imsart-ejs ver. 2011/11/15 file: clusters0320.tex date: April 2, 2013 



/Clustering Criterion 



13 



Table 1 

Simulated mean and variance of y/n(p n — 0.5) for different sample sizes for the normal case. 1000 
simulations were performed for each sample size. 



Sample sizes 


n = 100 


n = 300 


n = 500 


n = 1000 


Simulated Mean 


0.506 


0.504 


0.501 


0.502 


Simulated Variance 


0.614 


0.646 


0.700 


0.691 



3.2. An Example: Confidence Interval Estimation 

We demonstrate here how Theorem 3 can be employed to construct approximate con- 
fidence intervals (CI) for a theoretical split point. We consider a classical example of 
bimodal distribution — the variable "eruption" in the data set faithful available in R 
package MASS. The data set contains 272 measurements of the duration of eruption for 
the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. 

First, we plot the ECF for the variable "eruption"; the plot is given in Figure 1. We 




~i i i i i r~ 

0.0 0.2 0.4 0.6 0.8 1.0 

P 



Fig 1. Empirical Cross-over Function G n (p) for data set faithful. 

can see that G n (-) is generally a decreasing function that crosses zero line once, far away 
from and 1: the end-points of its domain which is the (0, 1) interval. Thus our point 
estimate of theoretical split point is p n = 97/272 sa .357. 

Now, to construct an approximate CI for po we need to estimate Var(8 Pa )/G' 2 (po). 
A straightforward (but rather tedious) calculation shows that this quantity explicitly 
depends on the following terms: p , Q(p ), f(Q(Po)), Qi(po), Qu(Po), 

B,(po) = ^E[W?I Wl<Q(po) ], and B u (p Q ) = ^— E [Wfl Wl > Q(po) }. 
Po J- — Po 
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We estimate these terms as follows: 

Po~Pn, Q(Po)~W {9S) , 

^98 272 

Qi(Po)~ ggS^i), Q M (po)« 272 _ 98 E^W' 

i=l i=99 
.98 - 272 

i=l i=99 

Finally, f(Q(po)) is estimated by /(W(98))i where / comes from the standard R function 
density. As a result, for instance, the 95% confidence interval for a theoretical split point 
Po is given by 

.357 ± .057. 



4. Discussion 



Admittedly, the definition of the empirical split point "in the range [a, 6]" might appear 
a bit artificial. But we still believe the results can be useful in practical applications. For 
instance, as with the faithful data, if we know that the distribution at hand is bimodal, 
and we want to estimate a split point between two clusters, it is safe to assume that the 
split point is in the range between two modes. 

It turns out that the behavior of the cross-over function when it is close to or 
1 can be rather complicated; under some natural assumptions, it can be shown that 
limsupp^ G{p) < 0. For example, it is true if W\ is bounded from above. When Q(l— ) = 
+oo, the following condition 

is sufficient for limsupp^ G(p) < 0. It is easy to see that, for instance, distributions with 
regularly varying tails and EWi +<L < oo, for e > 0, satisfy (4.1). However, it is possible to 
construct a distribution with a "bumpy" tail for which limsupp^ G(p) > 0. Consequently, 
it is suggestive that any extension of definition of p n to the entire interval (0, 1) will require 
some additional assumptions. 
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